Conversation
…w-operator into feat/multi-git-sync
|
All tests passed locally apart from which is a known issue and will be tackled separately. |
|
RELEASE-NOTE-SNIPPET: Airflow supports now multiple git-sync instances and thus multiple git repositories ( or multiple branches from one repository ). Note that the default path has changed while using git-sync to |
…w-operator into feat/multi-git-sync
adwk67
left a comment
There was a problem hiding this comment.
Just a couple of initial comments after running the tests.
| # will expect 2 (two containers, base and gitsync) | ||
| - script: kubectl -n $NAMESPACE get cm airflow-executor-pod-template -o json | jq -r '.data."airflow_executor_pod_template.yaml"' | grep "AIRFLOW_TEST_VAR" | wc -l | grep 2 | ||
| # will expect 1 (one container, gitsync) | ||
| # will expect 6 (2 from from the volume declaration + mounts to four containers, base and gitsync-0, gitsync-1 and multi-gitsync) |
There was a problem hiding this comment.
I only get 5 here: 3 volume mounts (base, git-sync-0, git-sync-1) and 2 from volume declaration. Likewise, probably, with the others.
| # activate DAG | ||
| response = requests.patch( | ||
| f"{rest_url}/dags/{dag_id}", headers=headers, json={"is_paused": False} | ||
| response_0 = requests.patch( |
There was a problem hiding this comment.
This step fails for me as the DAGs are imported with errors:
Traceback (most recent call last):
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
File "/stackable/app/allDAGs/current-0/mount-dags-gitsync/dags_airflow3/pyspark_pi.py", line 41, in <module>
with open(yaml_path, "r") as file:
^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/stackable/app/allDAGs/current-0/dags_airflow3/pyspark_pi.yaml'
Although this path does exist.
Tested with this:
[stackable@airflow-webserver-default-0 ~]$ airflow dags list-import-errors
/stackable/app/lib64/python3.12/site-packages/airflow/providers/common/compat/security/permissions.py:30 RemovedInAirflow4Warning: The airflow.security.permissions module is deprecated; please see https://airflow.apache.org/docs/apache-airflow/stable/security/deprecated_permissions.html
bundle_name | filepath | error
============+==========================================================+=========================================================================================================================
dags-folder | current-1/mount-dags-gitsync/dags_airflow3/pyspark_pi.py | Traceback (most recent call last):
| | File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
| | File "/stackable/app/allDAGs/current-1/mount-dags-gitsync/dags_airflow3/pyspark_pi.py", line 41, in <module>
| | with open(yaml_path, "r") as file:
| | ^^^^^^^^^^^^^^^^^^^^
| | FileNotFoundError: [Errno 2] No such file or directory: '/stackable/app/allDAGs/current-1/dags_airflow3/pyspark_pi.yaml'
| |
dags-folder | current-0/mount-dags-gitsync/dags_airflow3/pyspark_pi.py | Traceback (most recent call last):
| | File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
| | File "/stackable/app/allDAGs/current-0/mount-dags-gitsync/dags_airflow3/pyspark_pi.py", line 41, in <module>
| | with open(yaml_path, "r") as file:
| | ^^^^^^^^^^^^^^^^^^^^
| | FileNotFoundError: [Errno 2] No such file or directory: '/stackable/app/allDAGs/current-0/dags_airflow3/pyspark_pi.yaml'
| |
[stackable@airflow-webserver-default-0 ~]$
There was a problem hiding this comment.
I notice you have a different folder structure in your repo. In the one used for the tests we have:
gitFolder: "mount-dags-gitsync/dags_airflow3" and the dags in the pods are here:
/stackable/app/allDAGs/current-0/mount-dags-gitsync/dags_airflow3/ which doesn't match /stackable/app/allDAGs/current-0/dags_airflow3/pyspark_pi.yaml
There was a problem hiding this comment.
And I think this is brittle:
yaml_path = os.path.join(
os.environ.get("AIRFLOW__CORE__DAGS_FOLDER"), "current-1", "dags_airflow3" ,"pyspark_pi.yaml"
)
as the DAG itself has to know about the nestedness.
There was a problem hiding this comment.
Ah yeah, it slipped through. We'd need to teach the whole path to the DAG in the repository I guess.
I understand. We could introduce new env-variables which could be set via the operator like GIT_0 and GIT_1 pointing to the respective repositories. Then instead we could just
yaml_path = os.path.join(
os.environ.get("GIT_1") ,"pyspark_pi.yaml"
)
The user would set the path-to-folder ( "mount-dags-gitsync/dags_airflow3" in this case ) and we prefix it with the AIRFLOW_DAG_FOLDER path.
For integration tests I think it would maybe be fine to reference like it is currently?
There was a problem hiding this comment.
I'm trying with this which seems to work:
#yaml_path = os.path.join(
# os.environ.get("AIRFLOW__CORE__DAGS_FOLDER"), "current-0", "dags_airflow3" ,"pyspark_pi.yaml"
#)
dags_folder = Path(os.environ.get("AIRFLOW__CORE__DAGS_FOLDER", "/stackable/app/allDAGs")) / "current-0"
matches = list(dags_folder.rglob("pyspark_pi.yaml"))
if not matches:
raise FileNotFoundError(f"pyspark_pi.yaml not found under {dags_folder}")
if len(matches) > 1:
raise RuntimeError(f"Multiple pyspark_pi.yaml files found under {dags_folder}: {matches}")
yaml_path = matches[0]
|
|
||
| If you want to access multiple branches of a repository or load dags from multiple repositories, you can extend the list shown above. | ||
|
|
||
| The default mount-path for git-sync resources is `/stackable/app/allDAGs/current-{i}` where i corresponds to: |
There was a problem hiding this comment.
I don't think the user needs to know this as it is (or should be?) an internal detail.
| sshPrivateKeySecretName: git-sync-ssh | ||
| ---- | ||
|
|
||
| NOTE: Using DAG's from ConfigMaps will need you to either use celeryExecutors and mount them under `/stackable/app/allDAGs/<name>` or use kubernetesExecutor and mount them somewhere else and change the PYTHONPATH as well as the `AIRFLOW\__CORE__DAGS_FOLDER` accordingly. |
There was a problem hiding this comment.
| NOTE: Using DAG's from ConfigMaps will need you to either use celeryExecutors and mount them under `/stackable/app/allDAGs/<name>` or use kubernetesExecutor and mount them somewhere else and change the PYTHONPATH as well as the `AIRFLOW\__CORE__DAGS_FOLDER` accordingly. | |
| NOTE: Using DAGs from ConfigMaps will require you to either use celeryExecutors and mount them under `/stackable/app/allDAGs/<name>` or use the kubernetesExecutor and mount them somewhere else, changing the PYTHONPATH (if using submodules) and `AIRFLOW\__CORE__DAGS_FOLDER` accordingly. |
adwk67
left a comment
There was a problem hiding this comment.
Is this a breaking change? Will existing set-ups (DAGs mounted via CMs, single gitsync repos) still work? The versioning test (also using gitsync) does not pass for me (probably as it is still referencing the dag in the mani branch).
| symlinks | ||
| } | ||
|
|
||
| // kubernetesExecuter needs copy since it needs init-container. |
There was a problem hiding this comment.
Can we give a reason for this (I think it was something to do with the symlink not resulting in "clean" paths, or something...)
| container_debug_command(), | ||
| "airflow triggerer &".to_string(), | ||
| ]), | ||
| // This essentially doesn't work for KubernetesExecutor |
| format!( | ||
| "cp -RL {CONFIG_PATH}/{AIRFLOW_CONFIG_FILENAME} {AIRFLOW_HOME}/{AIRFLOW_CONFIG_FILENAME}" | ||
| ), | ||
| // Adding cm as dags within the same AIRFLOW_DAGS_FOLDER may lead to problems, thus checking if exists. |
There was a problem hiding this comment.
Does mean that we can't simply have, say, 2 DAGs in a single ConfigMap, which are mounted to AIRFLOW_DAGS_FOLDER, or something else?
Description
Part of #721
This PR adds support for multiple instances of git-sync.
This is done via a symlink onto the current folder of each git-sync instances.
In
kubernetesExecutorsacp -rreplaces the symlink as it is a one off anyways and it's easier to transition contents via volumeMounts and init-container.Integration test 'mount-dags-gitsync' has been changed to use two git-sync resources at different branches of the same repository.
Definition of Done Checklist
Author
Reviewer
Acceptance
type/deprecationlabel & add to the deprecation scheduletype/experimentallabel & add to the experimental features tracker